library(tidyverse)
library(visdat)
library(ggplot2)
library(tidymodels)
library(corrplot)
library(ggthemes)
library(janitor)
library(naniar)
library(rpart)
library(rpart.plot)
library(mice)
library(tidymodels)
library(themis)
library(naniar)
# install.packages("xgboost")
library(xgboost)
# install.packages("ranger")
library(ranger)
library(discrim)
library(vip)
tidymodels_prefer()
This project was focused on predicting insurance risk levels using the Prudential Life Insurance Assessment dataset, utilizing machine learning models adept at handling high-dimensional, binary classification data. The initial phase involved an exploratory data analysis (EDA), essential for gaining a fundamental understanding of the data set’s characteristics. Following this, data preprocessing was undertaken to ensure optimal model performance. The next step was to fit the data into a selection of models, carefully chosen for evaluation. This process included a detailed assessment of each model’s performance on both the training and test sets. The culmination of this effort was the selection of the best-performing model, which emerged as our final candidate for predicting insurance risk levels effectively.
Introduction
Exploratory Data Analysis (EDA)
Data Splitting and Cross-Validation
Model Fitting
Model Performance and Selection
Conclusion
Imagine being asked by an insurance broker, ‘What is your risk level?’ Such a question might seem unusual, even perplexing, especially from a consumer’s standpoint. Assessing one’s own risk for life insurance purposes is often a challenging task for many, including myself. Determining my own ‘risk level’ in life is not straightforward. This situation underscores a fundamental disconnect in the life insurance industry: consumers, the primary purchasers of life insurance, typically lack the necessary expertise to accurately gauge their risk levels.
A more realistic scenario involves insurance brokers conducting comprehensive risk assessments and then recommending life insurance products based on these evaluations.
My journey in Actuarial Science has provided a foundation in understanding how premiums are determined and the integral role of actuaries. Now, after studying machine learning, I am eager to apply these advanced techniques to the life insurance sector.
My journey in this project begins with a clear yet challenging task: to predict the insurance risk level of new applicants using their anonymized personal and health data. The normalization of this data, essential for privacy, poses a unique challenge due to the lack of descriptive labels for many variables.
Navigating through this anonymized data, I aim to discern patterns that differentiate lower-risk applicants from higher-risk ones. My focus is particularly drawn to the potential influence of health-related factors.
The adventure of this project lies in two key areas. Predictively, I intend to accurately forecast an applicant’s risk level using classification techniques. Inferentially, I’m set to explore which factors most significantly impact these risk levels.
My project is based on the Prudential Life Insurance Assessment data set, sourced from Kaggle. This risk data set, containing 59,381 observations, offers extensive information on insurance applicants. The data set can be accessed at “https://www.kaggle.com/c/prudential-life-insurance-assessment”.The data set is structured with a diverse range of variables, both categorical and numerical. These variables cover essential applicant demographics, detailed medical histories, and various lifestyle indicators. In total, there are 128 predictors in the data set: 13 of these are continuous, 5 are discrete, and the remaining are categorical.
A key element of the Prudential Life Insurance Assessment data set is the pre-cleansing and normalization process, designed to protect the privacy of insurance applicants. Given the data set’s extensive size, comprising 59,381 observations, I’ve chosen to analyze a smaller subset. This decision stems from computational limitations, as processing the entire data set exceeds the capabilities of my personal computer. Furthermore, I’ve adjusted the approach to risk level classification. Instead of working with the original eight risk levels, I’ve consolidated them into two broad categories, creating a binary system that distinguishes between higher and lower risk levels.
This section is an Exploratory Data Analysis of the insurance data set. My objective is to gain a comprehensive understanding of the data’s features and nuances. This process begins with an initial examination followed by addressing any missing values. I will also employ descriptive statistics to analyze the data set in depth. To complement this, visualizations will be used to identify and illustrate key patterns, ensuring that my analysis is closely aligned with the research goals.
# Loading the data
risk_data <- read_csv("risk.csv")
# Cleaning predictor names
risk_data <- clean_names(risk_data)
# Overall percentage of missing data in risk
overall_missing_risk <- mean(is.na(risk_data)) * 100
cat("Overall percentage of missing value in risk:", overall_missing_risk, "%\n")
## Overall percentage of missing value in risk: 5.171885 %
# Identify variables with any missing values
variables_with_missing <- risk_data %>%
summarise_all(funs(any(is.na(.)))) %>%
gather(key = "variable", value = "has_missing") %>%
filter(has_missing) %>%
select(variable)
# Subset the original data set to keep only variables with missing values
risk_with_missing_only <- risk_data %>%
select(one_of(variables_with_missing$variable))
# Plot missing data for the subset
vis_miss(risk_with_missing_only) + coord_flip()
The presence of 5.171885% missing values in the risk data set necessitates a careful approach to ensure the integrity of the analysis. The visualization focusing on variables with missing values is practical, avoiding an overcrowded plot that could obscure the crucial information about the extent of missingness in each predictor.
I opt to eliminate columns with more than 50% missing values, especially if they lack a clear contribution to the model’s predictive power. With an ample data set exceeding 50000 observations, even a 40% absence of data is significant and could skew results.
Consequently, variables like medical_history_32,
medical_history_24, medical_history_15,
medical_history_10, family_hist_5,
family_hist_3, family_hist_2,
insurance_history_5 are removed.
Additionally, the data set contains 48 dummy variables linked to medical keywords. Since these variables are just denote with numbers without additional context or explanation, I’ve decided to remove them to make the data set simpler and more manageable.
Finally, the removal of the ‘id’ column is a logical step, as it serves no predictive function and could potentially introduce noise into the modeling process. By purging these elements, the data set is refined for a more focused and potentially more accurate modeling endeavor.
medical_keywords <- paste0("medical_keyword_", 1:48)
drop <- c("medical_history_32", "medical_history_24", "medical_history_15", "medical_history_10", "id","family_hist_5","family_hist_3", "family_hist_2","insurance_history_5",medical_keywords)
# Drop the columns from the data set
risk_data <- risk_data[,!(names(risk_data) %in% drop)]
miss_case_table(risk_data)
## # A tibble: 6 × 3
## n_miss_in_case n_cases pct_cases
## <int> <int> <dbl>
## 1 0 23346 39.3
## 2 1 27123 45.7
## 3 2 8158 13.7
## 4 3 732 1.23
## 5 4 20 0.0337
## 6 5 2 0.00337
# First, add a new column that counts the number of missing values per case (row)
risk_data <- risk_data %>%
mutate(n_miss_in_case = rowSums(is.na(.))) %>%
filter(!(n_miss_in_case %in% c(3,4,5))) %>%
select(-n_miss_in_case)
miss_var_summary(risk_data)
## # A tibble: 71 × 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 family_hist_4 18503 31.6
## 2 employment_info_6 10282 17.5
## 3 medical_history_1 8225 14.0
## 4 employment_info_4 6416 10.9
## 5 employment_info_1 13 0.0222
## 6 product_info_1 0 0
## 7 product_info_2 0 0
## 8 product_info_3 0 0
## 9 product_info_4 0 0
## 10 product_info_5 0 0
## # ℹ 61 more rows
The miss_case_table() shows that cases
with 0 to 2 missing values make up 98% of the data, while those with 3,
4, or 5 missing values are less than 2%. Therefore, I’ll remove cases
with 3,4,or 5 missing values to maintain data quality.
risk_data <- risk_data %>% filter(!is.na(employment_info_1))
Since missing values in
employment_info_1 account for less than 1%
of the data, it’s practical to delete these instances directly.
For the other variables with missing values, I will employ imputation
methods during the recipe creation process using
step_impute(). This will preserve as much
data as possible for the analysis. To remind me imputing, I will denote
a list need_impute of missing variables.
need_impute <- c("family_hist_4","employment_info_6","medical_history_1","employment_info_4")
risk_data <- risk_data[1:1123,] %>%
mutate(
# Correctly convert 'response' from factor to numeric
response = ifelse(response >= 6, 1, 0)
)
risk_data$response <- factor(risk_data$response, levels = c(0, 1),labels = c( "Low","High"))
# Converting family_hist_1 to factors
for (i in 1) {
risk_data[[paste("family_hist", i, sep = "_")]] <- as.factor(risk_data[[paste("family_hist", i, sep = "_")]])
}
# Converting employment_info_2, employment_info_3, and employment_info_5 to factors
for (i in c(2, 3, 5)) {
risk_data[[paste("employment_info", i, sep = "_")]] <- as.factor(risk_data[[paste("employment_info", i, sep = "_")]])
}
# Converting all listed insurance_history variables to factors
for (i in c(1, 2, 3, 4, 7, 8, 9)) {
risk_data[[paste("insurance_history", i, sep = "_")]] <- as.factor(risk_data[[paste("insurance_history", i, sep = "_")]])
}
# Converting all listed medical_history variables to factors
for (i in c(2:9, 11:14, 16:23, 25:31, 33:41)) {
risk_data[[paste("medical_history", i, sep = "_")]] <- as.factor(risk_data[[paste("medical_history", i, sep = "_")]])
}
# Converting product_info_1 through product_info_7 to factors
for (i in c(1, 2, 3, 5,6, 7)) {
risk_data[[paste("product_info", i, sep = "_")]] <- as.factor(risk_data[[paste("product_info", i, sep = "_")]])
}
# Converting insuredinfo_1 through insuredinfo_7 to factors
for (i in 1:7) {
risk_data[[paste("insured_info", i, sep = "_")]] <- as.factor(risk_data[[paste("insured_info", i, sep = "_")]])
}
risk_data %>% dim()
## [1] 1123 71
Since the original data is too big to run, I only select part of it from the original data set. I do this after missing values because it would capture more features for the whole data. Here we can see the current dimensions of our data set after addressing the missing data values, selecting part of the data set, and factoring all the variables. We have 1123 observations and 71 variables
Data Dictionary (Predictors’ Meaning):
product_info_1-7: ((1, 2, 3, 5,6, 7 categorical)(4
numeric) A set of normalized variables relating to the product applied
for.
ins_age: (numeric) Normalized age of
applicant.
ht , : Normalized height of applicant.
wt: Normalized weight of applicant.
bmi: (numeric) Normalized BMI of applicant.
employment_info_1-6: (1,4,6 as numeric, 2,3,5 as
catergorical) A set of normalized variables relating to the employment
history of the applicant.
insured_info_1-7: (categorical) A set of normalized
variables providing information about the applicant.
insurance_history_1,2,3,4,6,7,8,9: (categorical) A
set of normalized variables relating to the insurance history of the
applicant.
family_hist_: (1 as categorical, 4 as numeric)
Normalized variables relating to the family history of the
applicant.
medical_history_(2-9,11-14,16-23,25-31,33-41):
(categorical) A set of normalized variables relating to the medical
history of the applicant.
response: This is the target variable, an
ordinal variable relating to the final decision associated with an
application, represened higher risk level when response = 1 and lower
risk level when response = 0.
# Calculate the correlation matrix
risk_complete <- risk_data[, sapply(risk_data, is.numeric)]
risk_complete <- na.omit(risk_complete) # Removes rows with any NA values
cor_matrix <- cor(risk_complete)
# Create a heatmap of the correlation matrix
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45, addCoef.col = "black")
The correlation graph provides valuable insights into the relationships between different variables. Notably, there is a strong positive correlation between weight weight and body mass index BMI, which aligns with the fact that BMI is calculated using weight and height. Given this redundancy, it makes sense to exclude weight from our analysis to avoid collinearity.
Additionally, insurance age ins_age shows a strong
correlation with family_hist_4. Considering that
ins_age is more straightforward to interpret, I will retain
it in the analysis and exclude family_hist_4. This
selective approach in our recipe helps mitigate collinearity and refines
the model for more accurate predictions.
In order to remind me on recipe, I will denote a list contain high correlation that need decide which one to use.
need_decide <- c("wt & bmi", "ins_age & family_history_4")
Before moving into the distribution of risk levels, it’s important to consider a fundamental question that influences the dynamics of life insurance. If you are in good health and your occupation carries no inherent risk, have you contemplated the necessity of life insurance? Conversely, if you find yourself in poor health or a high-risk job, does the prospect of purchasing life insurance seem more pressing? These considerations of personal health and occupational hazards are pivotal, as they can greatly influence an individual’s decision to opt for life insurance, potentially affecting the distribution of risk levels within our data set. Understanding these human factors is crucial as they offer context to the patterns we might observe in the risk classifications.
# Bar chart of the Response variable
ggplot(risk_data, aes(x=as.factor(response))) +
geom_bar(fill="blue") +
ggtitle("Distribution of the risk level") +
xlab("Risk Level") +
ylab("Count")
The bar chart displayed indicates a significant difference in the distribution of risk levels. There are notably more observations classified as ‘High’ risk compared to those deemed ‘Low’ risk. Specifically, the ‘Low’ risk category contains less than one-third of the number of observations in the ‘High’ risk category. This could reflect various factors, including the possibility that individuals with higher risk factors are more inclined to apply for life insurance, or it may suggest that the criteria used to define ‘High’ risk are more commonly met. This can also represent a slight imbalance in the levels of outcome which may need consider step_upsample in the recipe.
# Visualizing the distribution of 'response' across different categories of 'product_info_2'
ggplot(risk_data, aes(x = ins_age, fill = as.factor(response))) +
geom_bar(position = "dodge") +
labs(title = "Distribution of Response by Insurance Age", x = "ins_age", y = "Count") +
theme_minimal()
The histogram of the risk level distribution categorized by
ins_age, even after normalization, shows what appears to be
a bimodal distribution. The peaks around 0.25 and 0.6 suggest that there
are two age groups where both low and high response frequencies are
notably higher. This pattern might indicate that certain age ranges are
more prevalent or considered more significant when determining risk
levels for life insurance. Such a distribution could reflect different
life stages or insurance needs, which would be a relevant factor in the
modeling and analysis of insurance risk.
# Visualizing the distribution of 'response' across different categories of 'product_info_2'
# Visualizing the distribution of 'response' with respect to 'ins_age' and 'bpi'
ggplot(risk_data, aes(x = wt, y = bmi, color = as.factor(response))) +
geom_point(alpha = 0.6) +
labs(title = "Distribution of Response with Weight and BMI",
x = "Weight", y = "BMI", color = "Response") +
theme_minimal()
The scatter plot provided illustrates the distribution of risk levels about weight wt and body mass index BMI. It reinforces the strong correlation between these two factors. The data points form distinct clusters by response level, with observable trends in both weight and BMI dimensions. This suggests that there may be a pattern where certain ranges of weight and BMI are more associated with either a high or low-risk classification. The visual representation of colinearity between weight and BMI within each risk response group supports the decision to focus on one variable to avoid redundancy in the predictive modeling.
# Combine the plots into a single graph with facet_wrap
risk_data_long <- risk_data %>%
select(product_info_1, product_info_5, product_info_6, product_info_7) %>%
pivot_longer(cols = everything(), names_to = "product_info", values_to = "value")
ggplot(risk_data_long, aes(x=as.factor(value))) +
geom_bar(fill="blue") +
facet_wrap(~product_info, scales = "free_x") +
ggtitle("Distribution of Risk Level across Product Information") +
xlab("Risk Level") +
ylab("Count")
The distribution of the product_info_ variables in the
data set shows a pronounced skew towards one category, suggesting
limited variability. To refine the model and enhance its predictive
accuracy, I will use step_nzv(all_predictors()) in the
recipe to improve these near-zero variance predictors, thereby reducing
noise and focusing on more informative variables.
which(apply(risk_data, 2, var) == 0)
## medical_history_35
## 64
ggplot(risk_data,
aes(x=as.factor(medical_history_35))) +
geom_bar(fill="blue") +
ggtitle("Distribution of the medical history 35") +
xlab("Risk Level") +
ylab("Count")
When I did SVM model, there is a code that is not working and saying
that some one my column is not defied. After researching on the bug, I
find out that using only step_nzv() is not enough to remove
low variance predictor, since it does not remove the predictor with zero
variance.
Having addressed missing values, and conducted visual exploratory data analysis, the next step is to prepare the data set for the modeling process. This involves splitting the data into training and testing subsets, which will enable the evaluation of model performance on unseen data. Following the split, I will implement k-fold cross-validation on the training set. This technique not only reduces the variance of the performance estimate but also maximizes the use of available data for training. In the end, I will create a recipe ready for fitting in various models.
Since I do no have very large training data set, 75% would be my best choice for mentioning more data for training.
set.seed(1123)
risk_split <- risk_data %>%
initial_split(strata = response, prop = 0.75)
risk_train <- training(risk_split)
risk_test <- testing(risk_split)
# Dimensions of both train and test set on risk data
dim_train <- dim(risk_train)
cat("Dimensions of our risk training data set:", dim_train[1],"x",dim_train[2])
## Dimensions of our risk training data set: 841 x 71
dim_test <- dim(risk_test)
cat("Dimensions of our risk testing data set:", dim_test[1],"x",dim_test[2])
## Dimensions of our risk testing data set: 282 x 71
Considering the modest size of the training set and the need for each fold to be representative, I’ve decided to proceed with 5-fold cross-validation. This approach balances the need for a sufficient number of observations in each training fold while still reserving a substantial portion for validation.
risk_fold <- vfold_cv(risk_train, strata = response, v = 5)
need_decide
## [1] "wt & bmi" "ins_age & family_history_4"
need_impute
## [1] "family_hist_4" "employment_info_6" "medical_history_1"
## [4] "employment_info_4"
For the two sets of variables, prioritizing
bmi and
ins_age for inclusion in your recipe is
sensible due to their clear definitions and likely significance in
determining insurance risk.
Regarding the four variables requiring imputation, it’s important to consider the nature of each variable when selecting an imputation method. Since they all continuous variables, mean or median imputation could be appropriate.
Since my original data have been already normalized, additional
normalization may not be necessary. That why I did not center and scale
my predictors in the recipe. Additionally, the Distribution of Risk
Level across Product Information in 2 section indicate the need of
step_nzv(all_predictors(). Also not using
step_upsample() is that I want make use of random forest
whicg is very good at work with imblanced model.
Lastly, I also remove medical_history_35 from my recipe
due to zero variance of its values.
In the end, I will remove wt,
family_hist_4, and step_impute_mean()
employment_info_6, medical_history_1,
employment_info_4, and medical_history_35.
# Start building a recipe for the risk_train data set
risk_recipe <- recipe(response ~ ., data = risk_train) %>%
step_rm(c(wt,family_hist_4,medical_history_35)) %>%
# Impute missing values in 'employment_info_6','employment_info_4', "medical_history_1"using mean imputation.
step_impute_mean(employment_info_6,employment_info_4,medical_history_1) %>%
# remove near zero variance predictors
step_nzv(all_predictors()) %>%
# Convert all nominal predictors into dummy variables.
step_dummy(all_nominal_predictors())%>%
step_upsample(response,over_ratio = 1)
head(prep(risk_recipe) %>% bake(risk_train))
To make the rest of work done efficiently, I decided to save
risk_fold, risk_recipe,
risk_trian, risk_test in the file named
"risk_analysis_data.RData"
save(risk_fold, risk_recipe, risk_train, risk_test, file = "risk_analysis_data.RData")
Logistic Regression: Simple and effective, assuming a linear relationship between predictors and the binary outcome. Ideal for baseline modeling and interpretation.
Elastic Net Regression: A regularized regression method that linearly combines the L1 and L2 penalties of the Lasso and Ridge methods, suitable for dealing with multicollinearity and feature selection.
Random Forest: Highly suitable for your data with many predictors and a binary outcome. Excels in handling non-linear relationships and diverse data types.
Gradient Boosting Trees: Versatile for binary and multi-class outcomes, known for their predictive power and ability to handle various data complexities.
Support Vector Machine: Particularly effective for high-dimensional data. Great for both linear and non-linear decision boundaries, offering robustness in binary classification tasks.
Given the high-dimensional and binary classification nature of risk_data, Gradient Boosting machine, SVM, and Random Forest are strong contenders. Their ability to handle non-linearity and complex patterns makes them particularly promising. However, the true effectiveness of each model can only be determined after evaluating their performance on your specific data set. This will provide clearer insights into whether a linear or non-linear model is more appropriate.
I’ll be applying five different classification methods. For
evaluating and comparing these models, I’ll use the ROC_AUC
metric. It’s effective for comparing two models and assessing the same
model across various thresholds. Although our data isn’t extremely
imbalanced, it shows some trends, and ROC_AUC is known to
perform well with imbalanced data sets, unlike AUC.
Therefore, I’ve chosen ROC_AUCas the primary metric for
comparing models both within and between methods.
Setup: Define the model as logistic regression,
set the engine, and mode to classification.
Workflow: Create the workflow using the LOG model and risk_recipe.
Model Fitting: Fit the model to the resampling object (risk_fold)
Evaluation and Saving: Evaluate the model’s performance and save the results to an RDA file.
Setup: Define the model, set the engine, and mode to ‘classification’.
Workflow: Create a workflow, add the model, and integrate data preprocessing steps.
Hyperparameter Tuning: Set up a tuning grid specific to the model; conduct hyper-parameter tuning.
Model Selection and Fitting: Select the best
model from the tuning grid, finalize the workflow, and fit it to
(risk_fold). The best model would be the one have highest
ROC_AUC value.
Evaluation and Saving: Evaluate the model’s performance using metrics = ROC_AUC and save the results to an RDA file.
Here is every methods’ result RDA file.
load("/Users/zejiegao/Desktop/PSTAT231/Predential life Insurance/Sandy's Final Project/risk_analysis_data.RData")
load("/Users/zejiegao/Desktop/PSTAT231/Predential life Insurance/Sandy's Final Project/final_log_fit.rda")
load("/Users/zejiegao/Desktop/PSTAT231/Predential life Insurance/Sandy's Final Project/final_enr_fit.rda")
load("/Users/zejiegao/Desktop/PSTAT231/Predential life Insurance/Sandy's Final Project/final_rf_fit.rda")
load("/Users/zejiegao/Desktop/PSTAT231/Predential life Insurance/Sandy's Final Project/final_bt_fit.rda")
load("/Users/zejiegao/Desktop/PSTAT231/Predential life Insurance/Sandy's Final Project/svm_final_radial_fit.rda")
After fitting all models and identifying the best ones, I’m set to
analyze their performance, focusing on Gradient Boosting Tree, SVM, and
Random Forest due to their suitability for high-dimensional, binary data
in risk_data. I’ll begin by visually presenting ROC_AUC for
these methods, then compare them on the training set. Finally, I’ll
apply the selected models to the test data (risk_test) to evaluate their
true performance. This approach is crucial as models typically perform
well on training data, often showing high ROC_AUC. Testing
them on new data will reveal their actual effectiveness.
In our journey through Random Forest modeling, we’ve seen firsthand
the impact of our chosen hyper-parameters on prediction accuracy. The
number of treestrees, the minimum size of the tree
nodesmin_n, and the number of variables mtry
we consider at each decision point—these elements are key to our model’s
success.
More trees in the forest typically lead to better predictions, as indicated by higher ROC AUC values on our charts. Likewise, giving each node a bigger pool of cases before splitting helps improve the model’s accuracy. An optimal model may be roughly 10 trees, where each node begins with a minimum of 10 instances.
autoplot(tune_rf_fit) + theme_minimal()
Now let’s look at Gradient Boosting Machines, which are really good at figuring out hard problems. We’re focusing on something called the learning rate, which is like telling the model how fast to learn. If the learning rate is too slow, our model takes a really long time to get better. But if we set it just right, our model learns quickly and does a great job.
The picture tells us about trying different learning speeds and how many trees the model uses to learn. If the learning speed is too slow, the model hardly learns anything. But if we set the learning speed to a good level, our model starts to learn better and faster. It’s like each tree gives it a new idea, and the more trees it has, the more it knows. We want to find the best setup where our model can make really smart choices but not get mixed up. And it seems like setting the learning speed to about 0.1 and using 700 trees is the best way to do that.
autoplot(tune_bt_fit) + theme_minimal()
5.3 Support Vector Machine (SVM)
Moving on to Support Vector Machines, the chart shows us how well the SVM does this job as we change the cost value. The cost helps decide if we want our model to be really strict about getting the training data right, which might make it less flexible, or if we want it to be okay with some mistakes but work better on new data we give it later.
When we look at the chart, we see that when we turn up the cost, the model gets better at predicting up to a certain point. But if the cost gets too high, things start to go down, and the model doesn’t do as well. It looks like the best setting for the cost is around 2 because that’s where the model does its best job without getting too complicated.
svm_rbf_res %>% autoplot()
To comprehensively represent this data, I create a tibble showcasing
the ROC_AUC values for the best models from each method,
providing a clear comparison of their performance. Attention, this is
from risk_train data.
roc_auc_log <- augment(final_log_fit, new_data = risk_train) %>% roc_auc(truth = response, .pred_Low)
roc_auc_rf <- augment(final_rf_fit, new_data = risk_train) %>% roc_auc(truth = response, .pred_Low)
roc_auc_bt <- augment(final_bt_fit, new_data = risk_train) %>% roc_auc(truth = response, .pred_Low)
roc_auc_svm <- augment(svm_final_radial_fit, new_data = risk_train) %>% roc_auc(truth = response, .pred_Low)
roc_auc_enr <- augment(en_final_risk, new_data = risk_train) %>% roc_auc(truth = response, .pred_Low)
# Create a tibble with model names and their corresponding ROC AUC estimates
roc_auc_tibble <- tibble(
Model = c("Logistic Regression", "Elastic Net regression", "Random Forest", "Boosted Trees", "SVM"),
ROC_AUC_train = c(roc_auc_log$.estimate, roc_auc_enr$.estimate, roc_auc_rf$.estimate, roc_auc_bt$.estimate, roc_auc_svm$.estimate)
)
# Arrange the tibble in descending order based on ROC AUC
roc_auc_tibble_ordered <- roc_auc_tibble %>%
arrange(desc(ROC_AUC_train))
print(roc_auc_tibble_ordered)
## # A tibble: 5 × 2
## Model ROC_AUC_train
## <chr> <dbl>
## 1 Random Forest 0.957
## 2 Boosted Trees 0.940
## 3 SVM 0.933
## 4 Logistic Regression 0.903
## 5 Elastic Net regression 0.864
In assessing the capabilities of various models using the risk_train data, I’ve organized a table to display the ROC_AUC values. The Random Forest model stands out, securing the top position with the highest ROC_AUC, reflecting its excellent predictive prowess. Following it are the Boosted Trees, displaying strong performance as well. Next in line is the SVM model. Consistent with previous observations, these three models prove to be particularly adept at navigating the complexities of high-dimensional binary classification.
show_best(tune_rf_fit, n = 1)
## # A tibble: 1 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 6 10 10 roc_auc binary 0.763 5 0.0187 Preprocessor1_Model1…
Preprocessor1 Model 124 is our best random forest model !
When evaluating our model’s effectiveness on new data, it’s crucial to look beyond its success on the training set to how it performs when confronted with unseen cases. The model’s ability to generalize is captured in the ROC_AUC score on the risk_test sample, which sits at 0.780756. This is a commendable score, particularly as it falls between 70% and 80%.
The ROC curve, while not perfect—its highest point doesn’t quite reach the top left corner—still demonstrates a quality level of classification. It indicates that our model is discriminating between the classes effectively.
Further insights are gleaned from the confusion matrix. Here, the accuracy is derived by the sum of correct predictions (where prediction matches the truth) divided by the total number of cases tested. The model achieves approximately 73% accuracy, which reinforces the model’s solid performance. Notably, the model is particularly adept at identifying high-risk levels, which aligns with the composition of the training data that had a higher representation of high-risk cases.
roc_auc_rf_risk <- augment(final_rf_fit, new_data = risk_test) %>%
roc_auc(truth = response, .pred_Low) %>%
select(.estimate)
roc_auc_rf_risk
## # A tibble: 1 × 1
## .estimate
## <dbl>
## 1 0.781
risk_rf_roc_curve <- augment(final_rf_fit, new_data = risk_test, type = 'prob') %>%
roc_curve(response, .pred_Low) %>%
autoplot()
risk_rf_roc_curve
# Generating class predictions
final_rf_predictions <- augment(final_rf_fit, new_data = risk_test, type = 'class')
# Creating a confusion matrix
risk_conf_matrix <- conf_mat(final_rf_predictions, truth = response, estimate = .pred_class) %>%
autoplot(type = "heatmap")
risk_conf_matrix
show_best(tune_bt_fit, n = 1)
## # A tibble: 1 × 9
## mtry trees learn_rate .metric .estimator mean n std_err .config
## <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 700 0.1 roc_auc binary 0.813 5 0.0119 Preprocessor1_M…
Preprocessor1 Model 50 is the best!
In addition to exploring the Random Forest model, I have applied a Gradient Boosting Tree (GBT) model to my final test, intrigued by its ability to highlight the influence of each predictor on the outcome. The performance on the test data is promising—the GBT model achieves a ROC_AUC score of 0.7947596, which is even higher than that of the Random Forest. This could be attributed to GBT’s proficiency in handling large data sets and its focus on optimizing for high-value predictions, potentially mitigating issues that Random Forest models may encounter with larger amounts of data.
The GBT model’s ROC curve and confusion matrix affirm its effectiveness in distinguishing between high and low risk categories. By using the vip() function, BMI stands out as the most influential predictor. This is just like my expectation as BMI provides insights into a person’s health, implying potential risks such as heart conditions or diabetes. Other important factors include product_info_4 and employment_info_1, indicating their relevance in risk assessment. Contrary to expectations, insurance age (ins_age) did not carry as much influence, which was a surprising find. ]=
The accuracy, based on the confusion matrix, is calculated by adding the number of correct predictions for both high risk (159) and low risk (53) and dividing by the total number of cases, up to 75%.
roc_auc_bt_risk <- augment(final_bt_fit, new_data = risk_test) %>%
roc_auc(truth = response, .pred_Low) %>%
select(.estimate)
roc_auc_bt_risk
## # A tibble: 1 × 1
## .estimate
## <dbl>
## 1 0.795
risk_svm_roc_curve <- augment(final_bt_fit, new_data = risk_test, type = 'prob') %>%
roc_curve(response, .pred_Low) %>%
autoplot()
risk_svm_roc_curve
# Generating class predictions
final_bt_predictions <- augment(final_bt_fit, new_data = risk_test, type = 'class')
# Creating a confusion matrix
risk_conf_matrix <- conf_mat(final_bt_predictions, truth = response, estimate = .pred_class) %>%
autoplot(type = "heatmap")
risk_conf_matrix
final_bt_fit %>% extract_fit_parsnip() %>%
vip() +
theme_minimal()
Taking into account that the GBT model outperforms RF in terms of both ROC_AUC and accuracy, I conclude that Preprocessor1 Model 50 is my final selection.
Throughout this project, the primary objective was to harness machine learning techniques to predict insurance risk levels using normalized personal and health data. After a thorough exploration of various predictive models, the Gradient Boosting Tree (GBT) model, labeled as Preprocessor1 Model 50, stood out with superior performance, achieving a ROC_AUC score of 0.7947596 and an accuracy of approximately 75% on the risk_test sample. This model’s success is attributed to its capacity to handle the high-dimensional nature of the data set and to effectively identify the most influential predictors, such as BMI, which is a crucial indicator of health-related risks.
The journey began with a comprehensive exploratory data analysis, followed by data preprocessing, which included addressing missing values and refining the data set for optimal modeling. The correlation analysis, as part of the EDA, guided the selection of the most relevant variables and shaped the subsequent data preparation steps. The models were then evaluated, with the GBT model ultimately proving to be the most effective in terms of both prediction accuracy and its ability to generalize to new data.
I’m really thrilled with how this project turned out, especially getting to dive into real-world problems with machine learning. Still, I learned heaps about each model – their strengths, their weaknesses, all of it.
I would also like to thank Dr. Katie Coburn for her invaluable guidance and support throughout the writing of this project.